Welcome to my 5th homework assignment. This assignment has 4 parts and is about factor and figure management.
For this part I will drop factors/levels and reorder levels based on knowledge from data.
My plan is to: 1) Drop Asia from the gapminder data and show proof. 2) Reorder the levels of country
First, I will load all the packages that this assignment requires
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(kableExtra))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(scales))
I should mention that I will be using the forcats package for this part. This package is within the tidyverse package :smile:
Let’s check out the dataset!
#view the top 6 rows of the gapmidner dataset
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
Next, I will get to know my factor before I start to work with it
#display the internal structure of an R object
str(gapminder$continent)
## Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
#access to the levels attribute of a variable
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
#how many levels there are
nlevels(gapminder$continent)
## [1] 5
#class of the factor
class(gapminder$continent)
## [1] "factor"
First, I will filter Asia out of my dataset into gap_no_Asia
#remove Asia from dataset
gap_no_Asia <- gapminder %>%
filter(continent != "Asia")
Let’s see if this worked
#How many rows are in the original dataset
nrow(gapminder)
## [1] 1704
#How many rows are in the new dataset
nrow(gap_no_Asia)
## [1] 1308
This makes sense! gap_no_Asia has been reduced by the amount of entries that there are for Asia from the original gapminder dataset.
However, Asia is still showing up as a continent
gap_no_Asia$continent %>%
levels
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
It is necessary now to drop levels that are unused
#Saving the variable with the unused levels dropped
gap_no_Asia_drop <- gap_no_Asia %>%
droplevels()
Let’s do another check to see what continents are in this gap_no_Asia
gap_no_Asia_drop$continent %>%
levels
## [1] "Africa" "Americas" "Europe" "Oceania"
We have officially dropped Asia from the gapminder dataset, wahoo!
Next, I will reorder the levels of country by life expectancy for 2007. I will do this for the Europe to make the data set simple. Currently, they are no ordered alphabetically.
#filtering for the continent Oceania
gap_eur_2007 <- gapminder %>%
filter(year == 2007, continent == ("Europe"))
#showing my new data set
head(gap_eur_2007)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Austria Europe 2007 79.8 8199783 36126.
## 3 Belgium Europe 2007 79.4 10392226 33693.
## 4 Bosnia and Herzegovina Europe 2007 74.9 4552198 7446.
## 5 Bulgaria Europe 2007 73.0 7322858 10681.
## 6 Croatia Europe 2007 75.7 4493312 14619.
Reordering the data forwards and backwards (ascending and descending)
# Ordering Americas by maximum life expectancy
gap_eur_2007_reorder <- gap_eur_2007 %>%
mutate(country = fct_reorder(country, lifeExp, max))
#viewing the change
head(gap_eur_2007_reorder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Albania Europe 2007 76.4 3600523 5937.
## 2 Austria Europe 2007 79.8 8199783 36126.
## 3 Belgium Europe 2007 79.4 10392226 33693.
## 4 Bosnia and Herzegovina Europe 2007 74.9 4552198 7446.
## 5 Bulgaria Europe 2007 73.0 7322858 10681.
## 6 Croatia Europe 2007 75.7 4493312 14619.
gap_eur_2007_reorder_backwards <- gap_eur_2007 %>%
mutate(country = fct_reorder(country, lifeExp, max, .desc = TRUE))
Let’s check if this really worked. I will plota figure for gap_eur_2007 and gap_eur_2007_reorder, to see if my reordering made a difference.
#original figure
gap_eur_2007 %>%
ggplot(aes(lifeExp, country)) +
geom_point() +
theme_classic() +
ggtitle("Life expectancy of European countries in 2007 (no reordering")
#reordered figure by maximum life expectancy
gap_eur_2007_reorder %>%
ggplot(aes(lifeExp, country)) +
geom_point() +
theme_classic() +
ggtitle("Max life expectancy of European countries in 2007 (ascending)")
#figure with order descending
gap_eur_2007_reorder_backwards %>%
ggplot(aes(lifeExp, country)) +
geom_point() +
theme_classic() +
ggtitle("Max life expectancy of European countries in 2007 (descending)")
It is clear that reordering your data can be extremely helpful for viewing trends in figures. Before I reordered the data it is almost impossible to see a trend in the first figure. Once it is reordered it is very easy to identify the order of countries in terms of maximum life expectancy.
For this part I will experiment with changing, reading, and writing files.
My plan is to: 1) Create a new dataset 2) Write the data set to a csv file 3) Read the data set from the csv file 4) Visualize the data before and after reading the csv
I will create a dataset from gapminder that is of countries in Asia, sorted by population in 2007.
#creating the data frame
gap_Asia_2000_pop <- gapminder %>%
filter(year == 2007, continent == ("Asia")) %>%
arrange(pop) %>%
droplevels()
#writing the data frame
write.csv(gap_Asia_2000_pop, file = "gap_Asia_2000_pop.csv")
#reading the data frame
gap_Asia_2000_pop_READ <- read.csv("gap_Asia_2000_pop.csv")
Let’s see if gap_Asia_2000_pop can keep it’s integrity during the reading and writing process.
Before writing:
head(gap_Asia_2000_pop) %>%
knitr::kable(caption = "Table before writing the file")
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Bahrain | Asia | 2007 | 75.635 | 708573 | 29796.048 |
| Kuwait | Asia | 2007 | 77.588 | 2505559 | 47306.990 |
| Mongolia | Asia | 2007 | 66.803 | 2874127 | 3095.772 |
| Oman | Asia | 2007 | 75.640 | 3204897 | 22316.193 |
| Lebanon | Asia | 2007 | 71.993 | 3921278 | 10461.059 |
| West Bank and Gaza | Asia | 2007 | 73.422 | 4018332 | 3025.350 |
Reading the file:
head(gap_Asia_2000_pop_READ) %>%
knitr::kable(caption = "Table from reading the written file")
| X | country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|---|
| 1 | Bahrain | Asia | 2007 | 75.635 | 708573 | 29796.048 |
| 2 | Kuwait | Asia | 2007 | 77.588 | 2505559 | 47306.990 |
| 3 | Mongolia | Asia | 2007 | 66.803 | 2874127 | 3095.772 |
| 4 | Oman | Asia | 2007 | 75.640 | 3204897 | 22316.193 |
| 5 | Lebanon | Asia | 2007 | 71.993 | 3921278 | 10461.059 |
| 6 | West Bank and Gaza | Asia | 2007 | 73.422 | 4018332 | 3025.350 |
It looks like the data frame survived the reading and writing process. It did not default to being ordered alphabetically, it retained it’s ordering by population.
For this part I will create a figure with what I leanrnt in recent classes in mind. I will use a figure from my 2nd homework assignment (yikes) and recreate it.
My plan is to: 1) Show, and recreate a figure from my first homework assignment with an explanation of why it is better. 2) Recreate this figure using plotly, and expalin the benefits of using plotly.
Here is the first figure I made for a homework assignment in this class:
ggplot(gapminder, aes(continent, lifeExp)) +
geom_point()
Wow… that’s hard to look at after the course :smile: I guess it isn’t THAT bad but it can definitely use some work.
I will recreate this figure with some updates: 1) x and y axis labels changes 2) figure title 3) colour scheme 4) simple theme
ggplot(gapminder, aes(continent, lifeExp, colour=lifeExp)) +
geom_point() +
theme_classic() +
ggtitle("Life expectancy by continent") +
xlab("Continent") +
ylab("Life expectancy") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
For this part i’m going to use my data from the gap_eur_2007_reorder data frame (used in part 1) because I find it more interesting. I will make a simple figure here so you can see what it looks like before converting it to plotly:
Plot_normal <- ggplot(gap_eur_2007_reorder, aes(lifeExp, country, colour=country)) + geom_point() +
geom_smooth(method = loess) +
theme_classic() +
ggtitle("Life expectancy by European country in 2007") +
xlab("Country") +
ylab("Life expectancy") +
theme(legend.position = "none")
Plot_normal
Now, I will convert this figure to plotly. One of the biggest benefits of plotly is that you can put your curser over any point and see what the value is. This is especially useful when there is many y-valyes like there is in my figure. It can also zoom in and out, download the plot as .png, and show statistical summaries.
#creating plotly
ggplotly(Plot_normal)
For this part I will use ggsave() to save Plot_normal to a file. then I will use ! [Alt text] (/path/to/img.png) to load and embed it in my report.
#also specifying the scale and dpi of the figure
ggsave(filename = "KZ_Plot_normal.png", Plot_normal, dpi = 100, scale = 1.5)
## Saving 10.5 x 7.5 in image
Figure